Adaptive speech recognition method with noise compensation

ABSTRACT

An adaptive speech recognition method with noise compensation is disclosed. In speech recognition, optimal equalization factors for feature vectors of a plurality of speech frames corresponding to each probability density function in a speech model are determined based on the plurality of speech frames of the input speech and the speech model. The parameters of the speech model are adapted by the optimal equalization factor and a bias compensation vector, which is corresponding to and retrieved by the optimal equalization factor. The optimal equalization factor is provided to adjust a distance of the mean vector in the speech model. The bias compensation vector is provided to adjust a direction change of the mean vector in the speech model.

BACKGROUND OF THE INVENTION

1. Field of the Invention

The present invention relates to the field of speech recognition and,more particularly, to an adaptive speech recognition method with noisecompensation.

2. Description of Related Art

It is no doubt that the robustness issue is crucial in the area ofpattern recognition because, in real-world applications, the mismatchbetween training and testing data may occur to severely degrade therecognition performance considerably. For such a speech recognitionproblem, the mismatch comes from the variability of inter- andintra-speakers, transducers/channels and surrounding noises. Forinstance, considering the application of speech recognition forhands-free voice interface in a car environment, the non-stationarysurrounding noises of engine, music, babble, wind, echo under differentdriving speeds will vary and hence deteriorate the performance of therecognizer.

To solve the problem, a direct method is to collect enough training datafrom various noise conditions to generate speech models, such thatproper speech models can be selected based on the environment of aspecific application. However, such a method is impractical for theapplication in a car environment because of the complexity of noise andthe tremendous amount of training data to be collected. In addition, themethod requires additional mechanism to detect the change in theenvironment, and such environmental detector is difficult to design.

Alternatively, a feasible approach is to build an adaptive speechrecognizer where the speech models can be adapted to new environmentsusing environment-specific adaptation data.

In the context of statistical speech recognition, the optimal wordsequence W of an input utterance X={x_(t)} is determined according tothe Bayes rule:

Ŵ=arg _(w)max p(W|X)=arg_(w)max p(X|W)p(W),  (1)

where p(X|W) is the occurrence probability of X when the word sequenceof X is W, and p(W) is the occurrence probability of word W based on theprior knowledge of word sequence. The description of such a techniquecan be found in RABINER, L. R.: ‘A tutorial on hidden Markov models andselected applications in speech recognition’, Proceedings of IEEE, 1989,vol. 77, pp. 257-286, which is incorporated herein for reference. Usinga Markov chain to describe the change of the feature of the speechparameters, the p(X|W) can be further expressed, based on the HMM(Hidden Markov Model) theory, as follows: $\begin{matrix}{{{p\quad \left( X \middle| W \right)} = {{\sum\limits_{{all}\quad S}^{\quad}\quad {p\quad \left( {X,\left. S \middle| W \right.} \right)}} = {\sum\limits_{{all}\quad S}^{\quad}\quad {p\quad \left( {\left. X \middle| S \right.,W} \right)\quad p\quad \left( S \middle| W \right)}}}},} & (2)\end{matrix}$

where S is the state sequence of the speech signal X.

In general, the computations of (1) and (2) are very expensive andalmost unreachable because all possible S must be considered. Oneefficient approach is to apply the Viterbi algorithm and decode theoptimal state sequence Ŝ={ŝ_(t)}, as described in VITERBI, A. J.: ‘Errorbounds for conventional codes and an asymptotically optimal decodingalgorithm’, IEEE Trans. Information Theory, 1967, vol. IT-13, pp.260-269, which is incorporated herein for reference. As such, thesummation over all possible state sequences in (2) is accordinglyapproximated by the single most likely state sequence, i.e.$\begin{matrix}{{{{p\quad \left( X \middle| W \right)} \cong {p\quad \left( {\left. X \middle| \hat{S} \right.,W} \right)p\quad \left( \hat{S} \middle| W \right)}} = {\pi_{{\hat{s}}_{0}}\quad {\prod\limits_{t = 1}^{T}\quad {a_{{\hat{s}}_{i - 1}{\hat{s}}_{t}}b_{{\hat{s}}_{t}}\quad \left( x_{t} \right)}}}},} & (3)\end{matrix}$

where π_(ŝ) _(o) is the initial state probability, a_(ŝ) _(r−l) _(ŝ)_(t) is the state transition probability and b_(ŝ) _(t) (x_(t)) is theobservation probability density function of x_(t) in state ŝ_(t), whichis modeled by a mixture of multivariate Gaussian densities; that is:$\begin{matrix}{{b_{{\hat{s}}_{i}}\quad \left( x_{t} \right)} = {{p\quad \left( {{\left. x_{t} \middle| {\hat{s}}_{t} \right. = i},W} \right)} = {{\sum\limits_{k = 1}^{K}\quad {\omega_{ik}\quad f\quad \left( x_{t} \middle| \theta_{ik} \right)}} = {\sum\limits_{k = 1}^{K}\quad {\omega_{ik}\quad N\quad {\left( {\left. x_{l} \middle| \mu_{ik} \right.,\sum_{ik}} \right).}}}}}} & (4)\end{matrix}$

Herein, ω_(ik) is mixture weight, and μ_(ik) and Σ_(ik) are respectivelythe mean vector and covariance matrix of the k-th mixture densityfunction for the state ŝ_(t)=i. The occurrence probabilityf(x_(t)|θ_(ik)) of frame x_(t) associated with the density functionθ_(ik)=(μ_(ik),Σ_(ik)) is expressed by:

f(x _(t)|θ_(ik))=(2π)^(−D/2)|Σ_(ik)|^(−½)exp[−½(x _(t)−μ_(ik))′Σ_(ik)⁻¹(x _(t)−μ_(ik))].  (5)

The construction of speech recognition system is achieved by determiningthe HMM parameters, such as {μ_(ik),Σ_(ik)} {ω_(ik)} and {a_(ij)}. Thespeech recognition system is thus operated by using Viterbi algorithm todetermine the optimal word sequence for the input speech. However, thesurrounding noises will cause a mismatch between the speech features ofthe application environment and the training environment. As a result,the established HMM's can not correctly describe the input speech, andthe recognition rate is decreased. Particularly in the car environment,the noises are so adverse so that the performance of the speechrecognition system in the car is much lower than that in a cleanenvironment. Therefore, in order to implement, for example, an importantapplication for human-machine voice interface in car environments, anadaptive speech recognition method with noise compensation is desired,so as to promote the recognition rate.

Moreover, Mansour and Juang observed that the additive white noise wouldcause the norm shrinkage of speech cepstral vector, and a description ofsuch can be found in MANSOUR, D. and JUANG, B. -H.: ‘A family ofdistortion measures based upon projection operation for robust speechrecognition’, IEEE Trans. Acoustic, Speech, Signal Processing, 1989,vol. 37, pp. 1659-1671, which is incorporated herein for reference. Theyconsequently designed a distance measure where a scaling factor wasintroduced to compensate the cepstral shrinkage for cepstrum-basedspeech recognition. This approach was further extended to the adaptationof HMM parameters by detecting an equalization scalar λ betweenprobability density function unit θ_(ik) and noisy speech frame x_(t),as described in CARLSON, B. A. and CLEMENTS, M. A.: ‘A projection-basedlikelihood measure for speech recognition in noise’, IEEE Transactionson Speech and Audio Processing, 1994, vol. 2, no. 6, pp. 97-102, whichis incorporated herein for reference. The probability measurement in (5)is modified to:

f(x _(t)|λ,θ_(ik))=(2π)^(−D/2)|Σ_(ik)|^(−½)exp[−½(x _(t)−λμ_(ik))′Σ_(ik)⁻¹(x _(t)−λμ_(ik))].  (6)

The optimal equalization factor λ_(e) is determined by directlymaximizing the logarithm of (6) as follows: $\begin{matrix}{\lambda_{e} = {{\underset{\lambda}{\arg \quad \max}\quad \log \quad f\quad \left( {\left. x_{t} \middle| \lambda \right.,\theta_{ik}} \right)} = {\frac{x_{t}^{\prime}\quad {\sum\limits_{ik}^{- 1}\quad \mu_{ik}}}{\mu_{ik}^{\prime}\quad {\sum\limits_{ik}^{- 1}\quad \mu_{ik}}}.}}} & (7)\end{matrix}$

Geometrically, this factor is equivalent to the projection of x_(t) uponμ_(ik) weighted by Σ_(ik) ⁻¹. The use of λ_(e) to compensate theinfluence of the white noise is proved to be helpful in increasing thespeech recognition rate. However, for the problem of speech recognitionin car environments, the surrounding noise is non-white andsophisticated to characterize. It is thus insufficient to adapt the HMMmean vector μ_(ik) by only applying the optimal equalization scalarλ_(e). Therefore, there is a need for the above speech recognitionmethod to be improved.

SUMMARY OF THE INVENTION

The object of the present invention is to provide an adaptive speechrecognition method with noise compensation for effectively promoting thespeech recognition rate in a noisy environment.

To achieve the object, the adaptive speech recognition method with noisecompensation in accordance with the present invention is capable ofcompensating noises of an input speech by adjusting parameters of a HMMspeech model. The method includes the following steps: (A) determining,based on the plurality of speech frames of the input speech and thespeech model, optimal equalization factors for feature vectors of theplurality of speech frames corresponding to each probability densityfunction in the speech model; and (B) adapting the parameters of thespeech model by the optimal equalization factor and a bias compensationvector corresponding to and retrieved by the optimal equalizationfactor, wherein the optimal equalization factor is provided to adjust adistance of the mean vector in the speech model, and the biascompensation vector is provided to adjust a direction change of the meanvector in the speech model.

Other objects, advantages, and novel features of the invention willbecome more apparent from the following detailed description when takenin conjunction with the accompanying drawings.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is the flow chart of the adaptive speech recognition method withnoise compensation in accordance with the present invention;

FIG. 2 is a flow chart for establishing a reference function table inaccordance with the present invention;

FIG. 3 is a scatter diagram plotted by the process of establishing thereference function table; and

FIG. 4 is an exemplary reference function table established according toFIG. 3.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT

With reference to FIG. 1, there is shown a preferred embodiment of theadaptive speech recognition method with noise compensation in accordancewith the present invention, which utilizes a speech model 11 and areference function table 12 for performing a recognition process to theinput speech, which includes a plurality of speech frames, so as tooutput the recognition result of an optimal word sequence Ŵ.

As shown in FIG. 1, a feature analysis process is first applied to theinput speech frame (step S11), so as to generate corresponding featurevector x_(t) for output. In step S12, for the input feature vector x_(t)and the corresponding parameters θ_(ik)=(μ_(ik), Σ_(ik)) of the speechmodel 11, respective optimal equalization factor is determined by${\lambda_{e} = {{\underset{\lambda}{\arg \quad \max}\quad \log \quad f\quad \left( {\left. x_{t} \middle| \lambda \right.,\theta_{ik}} \right)} = \frac{x_{t}^{\prime}\quad {\sum\limits_{ik}^{- 1}\quad \mu_{ik}}}{\mu_{ik}^{\prime}\quad {\sum\limits_{ik}^{- 1}\quad \mu_{ik}}}}},$

HMM model established in a clean environment.

The optimal equalization factor λ_(e) obtained in step S12 is equivalentto the projection of x_(t) upon Σ_(ik) ⁻¹μ_(ik), which is used to adjustthe distance of the mean vector of the speech model 11. In step S13, theoptimal equalization factor λ_(e) is used as a relational referenceindex to retrieve the corresponding bias compensation vector b(λ_(e))from the reference function table 12 for further adjusting the directionchange of the mean vector of the speech model 11, thereby removing theprojection bias.

FIG. 2 illustrates an exemplary estimation procedure to establish thereference function table 12, which first collects a small set of noisytraining speech data 21, and performs a feature analysis on the trainingspeech data (step S21), so as to generate feature vectors X={x_(t)} foroutput. In step 22, a calculation is undertaken based on the HMM speechmodel 11 and the feature vectors X={x_(t)} to determine the optimalequalization factors {λ_(e)} for pairs of x_(t) and all probabilitydensity functions {θ_(ik)}={μ_(ik), Σ_(ik)} in HMM speech model 11.Next, the corresponding adaptation bias vectors {x_(t)−λ_(e)μ_(ik)} arecalculated (step S23).

These pairs of optimal equalization factors {λ_(e)} and bias vectors{x_(t)−λ_(e)μ_(ik)} are then plotted in a scatter diagram, as shown inFIG. 3. Based on the relation between the optimal equalization factorλ_(e) and the bias vector x_(t)−λ_(e)μ_(ik) expressed by the scatterdiagram, the bias compensation vectors b(λ_(e)) corresponding to theoptimal equalization factors {λ_(e)} are piecewisely estimated byaveraging the scattered values {x_(t)−λ_(e)μ_(ik)} where the step sizeof λ_(e) is specified by, for example, 0.01 (step S24). As such, thereference function table 12 is established, in which the biascompensation vectors b(λ_(e)) can be retrieved by using the optimalequalization factor λ_(e) as an index. An exemplary reference functiontable established according to FIG. 3 is illustrated in FIG. 4.

With reference to FIG. 1 again, in step S14, based on the calculatedoptimal equalization factor λ_(e) and the retrieved bias compensationvector b(λ_(e)), the calculation of probability f(x_(t)|θ_(ik)) in therecognition process for input speech is adapted by λ_(e)μ_(ik)+b(λ_(e)).That is, the measurement of the probability turns out to be:

f(x_(t)|λ_(e) , b(λ_(e)),θ_(ik))=(2π)^(−D/2)|Σ_(ik)|^(−½)×exp{−½[x_(t)−λ_(e)μ_(ik) −b(λ_(e))]′Σ_(ik) ⁻¹ [x _(t)−λ_(e)μ_(ik)−b(λ_(e))]}.  (8)

Because the bias compensation vector b(λ_(e)) is shared by all HMM unitsθ={θ_(ik)}, sufficient pairs of {x_(t)−λ_(e)μ_(ik)} and {λ_(e)} can beobtained with only a small set of adaptation speech data, therebyavoiding a known data sparseness problem. Applying the probabilitymeasurement of (8) to Viterbi decoding algorithm (step S15), the optimalword sequence Ŵ associated with the input speech data X is generated foroutput.

In the above preferred embodiment, the reference function table 12 isestablished before a speech recognition process is carried out, andthus, when executing the speech recognition process, the referencefunction table 12 can be referenced and retrieved. Moreover, in therecognition process, the content of the reference function table 12 canbe modified, by the method of establishing the table, based on therealistic environment. Specifically, if the recognition rate is notsatisfactory, for example less than 50%, the reference function tablewill be modified on line by using the actual speech samples asadaptation speech data, so that the bias compensation vectors b(λ_(e))can correctly reflect the actual environment noise, whereby therecognition rate can be further enhanced.

In view of the foregoing, it is appreciated that the present inventionis able to perform an online adaptive speech recognition method, and, inthe recognition process, both the distance and direction of the meanvector of the probability density function in the speech model areadjusted. Therefore, the performance of a speech recognizer can besignificantly promoted.

Although the present invention has been explained in relation to itspreferred embodiment, it is to be understood that many other possiblemodifications and variations can be made without departing from thespirit and scope of the invention as hereinafter claimed.

What is claimed is:
 1. An adaptive speech recognition method with noisecompensation capable of compensating noises of an input speech byadjusting parameters of a HMM (Hidden Markov Model) speech model, theinput speech having a plurality of speech frames, the method comprisingthe steps of: (A) determining, based on the plurality of speech framesof the input speech and the speech model, optimal equalization factorsfor feature vectors of the plurality of speech frames corresponding toeach probability density function in the speech model, wherein theoptimal equalization factor is determined based on the parametersθ_(ik)=(μ_(ik), Σ_(ik)) of the speech model, and is equivalent to aprojection of the speech frame upon Σ_(ik) ⁻¹μ_(ik), where μ_(ik) andΣ_(ik) are respectively the mean vector and covariance matrix of thek-th mixture density function for a state ŝ_(t)=i in the speech model;and (B) adapting the parameters of the speech model by the optimalequalization factor and a bias compensation vector corresponding to andretrieved by the optimal equalization factor, wherein the optimalequalization factor is provided to adjust a distance of the mean vectorin the speech model, and the bias compensation vector is provided toadjust a direction change of the mean vector in the speech model.
 2. Themethod as claimed in claim 1, wherein the bias compensation vector isobtained and stored in a reference function table based on noisy speechdata before executing a speech recognition process.
 3. The method asclaimed in claim 1, wherein, in step (B), the bias compensation vectoris retrieved from a reference function table by using a correspondingoptimal equalization factor as an index, so as to adjust the directionof the mean vector and remove projection bias.
 4. The method as claimedin claim 3, wherein, the reference function table is established by thesteps of: calculating the optimal equalization factors for pairs of eachspeech frame and all parameters in the speech model based on the speechmodel and the noisy speech data. calculating adaptation bias vectorscorresponding to the optimal equalization factors; and piecewiselyestimating, based on the relation between the optimal equalizationfactors and the adaptation bias vectors, the bias compensation illvectors by averaging the adaptation bias vectors.
 5. The method asclaimed in claim 4, wherein, the reference function table can bemodified on line by actual input speech in executing a recognitionprocess.
 6. The method as claimed in claim 1, wherein, in step (B),based on the determined optimal equalization factor λ_(e) and theretrieved bias compensation vector b(λ_(e)), a calculation ofprobability for speech recognition is adapted by λ_(e)μ_(ik)+b(λ_(e)),where μ_(ik) is the mean vector of the k-th mixture density function fora state ŝ_(t)=i in the speech model.
 7. The method as claimed in claim1, further comprising a step, before step (A), for performing a featureanalysis on the speech frames of the input speech.
 8. The method asclaimed in claim 1, further comprising a step, after step (A), forexecuting a Viterbi decoding algorithm.
 9. An adaptive speechrecognition method with noise compensation capable of compensatingnoises of an input speech by adjusting parameters of a HMM (HiddenMarkov Model) speech model, the input speech having a plurality ofspeech frames, the method comprising the steps of: (A) determining,based on the plurality of speech frames of the input speech and thespeech model, optimal equalization factors for feature vectors of theplurality of speech frames corresponding to each probability densityfunction in the speech model; and (B) adapting the parameters of thespeech model by the optimal equalization factor and a bias compensationvector corresponding to and retrieved by the optimal equalizationfactor, wherein the optimal equalization factor is provided to adjust adistance of the mean vector in the speech model, and the biascompensation vector is provided to adjust a direction change of the meanvector in the speech model, wherein the bias compensation vector isretrieved from a reference function table by using a correspondingoptimal equalization factor as an index, so as to adjust the directionof the mean vector and remove projection bias.